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Abstract 

The process of dynamic state estimation (filtering) based on point process ob¬ 
servations is in general intractable. Numerical sampling techniques are often practi¬ 
cally useful, but lead to limited conceptual insight about optimal encoding/decoding 
strategies, which are of significant relevance to Computational Neuroscience. We de¬ 
velop an analytically tractable Bayesian approximation to optimal filtering based on 
point process observations, which allows us to introduce distributional assumptions 
about sensory cell properties, that greatly facilitates the analysis of optimal encoding 
in situations deviating from common assumptions of uniform coding. The analytic 
framework leads to insights which are difficult to obtain from numerical algorithms, 
and is consistent with experiments about the distribution of tuning curve centers. 
Interestingly, we find that the information gained from the absence of spikes may be 
crucial to performance. 


1 Introduction 

The task of inferring a hidden dynamic state based on partial noisy observations plays 
an important role within both applied and natural domains. A widely studied problem 
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is that of online inference of the hidden state at a given time based on observations up 
to to that time, referred to as filtering [T]. For the linear setting with Gaussian noise 
and quadratic cost, the solution is well known since the early 1960s both for discrete and 
continuous times, leading to the celebrated Kalman and the Kalman-Bucy filters 12 E], 
respectively. In these cases the exact posterior distribution is Gaussian, resulting in closed 
form recursive update equations for the mean and variance of this distribution, implying 
finite-dimensional filters. However, beyond some very specific settings [1], the optimal 
filter is infinite-dimensional and impossible to compute in closed form, requiring either 
approximate analytic techniques (e.g., the extended Kalman filter (e.g., DP) , the unscented 
filter m or numerical procedures (e.g., particle filters i)- The latter usually require time 
discretization and a finite number of particles, resulting in loss of precision . For many 
practical tasks (e.g., queuing [2 and optical communication i) and biologically motivated 
problems (e.g., 0) a natural observation process is given by a point process observer, 
leading to a nonlinear infinite-dimensional optimal filter (except in specific settings, e.g., 
finite state spaces, lanni). 

We consider a continuous-state and continuous-time multivariate hidden Markov pro¬ 
cess observed through a set of sensory neuron-like elements characterized by multi¬ 
dimensional tuning functions, representing the elements’ average firing rate. The tuning 
function parameters are characterized by a distribution allowing much flexibility. The ac¬ 
tual firing of each cell is random and is given by a Poisson process with rate determined 
by the input and by the cell’s tuning function. Inferring the hidden state under such cir¬ 
cumstances has been widely studied within the Computational Neuroscience literature, 
mostly for static stimuli. In the more challenging and practically important dynamic set¬ 
ting, much work has been devoted to the development of numerical sampling techniques 
for fast and effective approximation of the posterior distribution (e.g., m)- In this work 
we are less concerned with algorithmic issues, and more with establishing closed-form 
analytic expressions for an approximately optimal filter (see |I0[ II2L113) for previous work 
in related, but more restrictive settings), and using these to characterize the nature of 
near-optimal encoders, namely determining the structure of the tuning functions for op¬ 
timal state inference. A significant advantage of the closed form expressions over purely 
numerical techniques is the insight and intuition that is gained from them about quali¬ 
tative aspects of the system. Moreover, the leverage gained by the analytic computation 
contributes to reducing the variance inherent to Monte Garlo approaches. Technically, 
given the intractable infinite-dimensional nature of the posterior distribution, we use a 
projection method replacing the full posterior at each point in time by a projection onto 
a simple family of distributions (Gaussian in our case). This approach, originally devel¬ 
oped in the Filtering literature [HI [IS], and termed Assumed Density Filtering (ADF), 
has been successfully used more recently in Machine Learning imiiii. As far as we are 
aware, this is the first application of this methodology to point process filtering. 

The main contributions of the paper are the following: (i) Derivation of closed form 
recursive expressions for the continuous time posterior mean and variance within the ADF 
approximation, allowing for the incorporation of distributional assumptions over sensory 
variables, (ii) Gharacterization of the optimal tuning curves (encoders) for sensory cells in 
a more general setting than hitherto considered. Specifically, we study the optimal shift 
of tuning curve centers, providing an explanation for observed experimental phenomena 
|I8j . (iii) Demonstration that absence of spikes is informative, and that, depending on the 
relationship between the tuning curve distribution and the dynamic process (the ‘prior’). 
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may significantly improve the inference. This issue has not been emphasized in previous 
studies focusing on homogeneous populations. 

We note that most previous work in the field of neural encoding/decoding has dealt 
with static observations and was based on the Fisher information, which often leads to 
misleading qualitative results (e.g., [HIEQ]). Our results address the full dynamic setting 
in continuous time, and provide results for the posterior variance, which is shown to 
yield an excellent approximation of the posterior Mean Square Error (MSE). Previous 
work addressing non-uniform distributions over tuning curve parameters |21j used static 
univariate observations and was based on Fisher information rather than the MSE itself. 


2 Problem formulation 


2.1 Dense Gaussian neural code 

We consider a dynamical system with state Xt € K", observed through an observation 
process N describing the firing patterns of sensory neurons in response to the process X. 
The observed process is a diffusion process obeying the Stochastic Differential Equation 
(SDE) 

dXt = A{Xt)dt +D{Xt)dWu (f>0) 

where A{-),D{-) are arbitrary functions and Wt is standard Brownian motion. The 
initial condition Xq is assumed to have a continuous distribution with a known density. 
The observation process TV is a marked point process [S] defined on [0, oo) x K™, meaning 
that each point, representing the firing of a neuron, is identified by its time t € [0, oo), 
and a mark 6 € K™. In this work the mark is interpreted as a parameter of the firing 
neuron, which we refer to as the neuron’s preferred stimulus. Specifically, a neuron with 
parameter 9 is taken to have firing rate 

X{x-,9) = hexp \\Hx - > 


in response to state x, where H G and Etc G , m < n, are fixed matrices, and 

the notation ||y||^ denotes y"'"My. The choice of Gaussian form for A facilitates analytic 
tractability. The inclusion of the matrix H allows using high-dimensional models where 
only some dimensions are observed, for example when the full state includes velocities 
but only locations are directly observable. We also define Nt = N ([0,t) x K"*), i.e., is 
the total number of points up to time t, regardless of their location 9, and denote by Aft 
the sequenee of points up to time t — formally, the process N restricted to [0,f) x K™. 
Following [^, we use the notation 


[ [ f{t,9)N{dtxd9)^y^l{UG[a,b],9,GU}f{U,9,)., 

Ja Ju 


( 1 ) 


for U C and any function /, where {ti,9i) are respectively the time and mark of the 
i-th point of the process N. 

Consider a network with M sensory neurons, having random preferred stimuli 9 = 
that are drawn independently from a common distribution with probability den¬ 
sity f{9), which we refer to as the population density. Positing a distribution for the 
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preferred stimuli allows us to obtain simple closed form solutions, and to optimize over 
distribution parameters rather than over the higher-dimensional space of all the 9i. The 
total rate of spikes with preferred stimuli in a set A C K™, given Xt = x, is then 

(x; 9 ) = hJ2i l{6»iGA} exp ^ \\Hx — . Averaging over / (9), we have the ex¬ 
pected rate Aa { x ) = EAa (a;; 9 ) = hM f (9) exp A \\Hx — d9. We now obtain 

an infinite neural network by considering the limit M —>■ oo while holding A° = hM fixed. 

In the limit we have Aa (a:; 9 ) —>• Aa (a;), so that the process N has density 

At (0,W) = AV Wexp \\HX, - , (2) 

meaning that the expected number of points in a small rectangle [t, t + dt] x [9i, 9i + d9i\, 
conditioned on the history A[o_t], is At {9, Xt) dtJli d9i + o {dt, \ d9\). A finite network can 
be obtained as a special case by taking / to be a sum of delta functions representing a 
discrete distribution. 

For analytic tractability, we assume that / {9) is Gaussian with center c and covariance 
Epop, namely / (9) = Af{9; c, Spop). We refer to c as the population center. Previous work 
|231I201[^ considered the case where neurons’ preferred stimuli uniformly cover the space, 
obtained by removing the factor / {9) from ([^. Then, the total firing rate / A* {9,x) d9 
is independent of cc, which simplihes the analysis, and leads to a Gaussian posterior (see 
P5]l. We refer to the assumption that / At (0, x) d9 is independent of x as uniform coding. 

The uniform coding case may be obtained from our model by taking the limit Ep^p 0 
with A^/i^det Epop held constant. 

2.2 Optimal encoding and decoding 

We consider the question of optimal encoding and decoding under the above model. The 
process of neural decoding is assumed to compute (exactly or approximately) the full 
posterior distribution of Xt given Aft. The problem of neural encoding is then to choose 
the parameters (j) = (c, Epop, Etc), which govern the statistics of the observation process 
N, given a specihc decoding scheme. 

To quantify the performance of the encoding-decoding system, we summarize the re¬ 
sult of decoding using a single estimator Xt = W (M), and define the Mean Square Error 
(MSE) as et = tra.ce[{Xt—Xt){Xt—Xt)'^]. We seek Xt and (f that solve min,^ limt_too niin^^ E [e*] 
min,^ limt_>oo E[minE[et |A/t]]. The inner minimization problem in this equation is solved 
by the MSE-optimal decoder, which is the posterior mean Xt = ht — E[At|A/(]. The 
posterior mean may be computed from the full posterior obtained by decoding. The 
outer minimization problem is solved by the optimal encoder. In principle, the encod¬ 
ing/decoding problem can be solved for any value of t. In order to assess performance it 
is convenient to consider the steady-state limit t > oo for the encoding problem. 

Below, we find a closed form approximate solution to the decoding problem for any t 
using ADF. We then explore the problem of choosing the steady-state optimal encoding 
parameters </> using Monte Garlo simulations. Note that if decoding is exact, the problem 
of optimal encoding becomes that of minimizing the expected posterior variance. 
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3 Neural decoding 


3.1 Exact filtering equations 

Let P {‘tt) denote the posterior density of Xt given A/j, and Ep [•] the posterior expectation 
given Aft- The prior density P {-,0) is assumed to be known. 

The problem of filtering a diffusion process X from a doubly stochastic Poisson process 
driven by X is formally solved in [25]. The result is extended to marked point processes 
in [23], where the authors derive a stochastic PDE for the posterior densitjQ 


dP {x, t) = C*P (x, t)dt + P (x, t) [ At(6>,x) Xt{e) / ^ a ^ 

leeR™ At (6) ^ > 

(3) 

where the integral with respect to N is interpreted as in Q, £ is the state’s pos¬ 
terior infinitesimal generator (Kolmogorov’s backward operator), defined as Cf (x) = 
limAt->.o+ (E [/ {Xt+At) \Xt = x\ — f (x)) /At, C* is £’s adjoint operator (Kolmogorov’s 
forward operator), and At (0) = Ep [At {9, Xt)] = J P (x, t) Xt (0, x) dx. 

The stochastic PDE ([^ is usually intractable. In [23l [24] the authors consider linear 
dynamics with uniform coding and Gaussian prior. In this case, the posterior is Gaussian, 
and (§ leads to closed form ODEs for its moments. When the uniform coding assumption 
is violated, the posterior is no longer Gaussian. Still, we can obtain exact equations for 
the posterior moments, as follows. 

Let Pit = E*pW,At = Xt-pit,^t= E^piXtXn Using (§, along with known results 
about the form of the infinitesimal generator Ct for diffusion processes (see appendix), 
the first two posterior moments can be shown to obey the following equations between 
spikes (see M)- 


dpit 

dt 

dSt 

dt 


E^ [A (At)]+E^ 
A{Xt)Xj 


Xt 


Xt {9)-Xt {e,Xt)]de 


E^ 




XtA{Xt) 


■Pfp 


D{Xt)D{Xty 




Xtxy / (Xt{e)-Xt{e,Xt))d9 


(4) 


3.2 ADF approximation 

While equations Q are exact, they are not practical, since they require computation of 
Ep [•]. We now proceed to find an approximate closed form for Q. Here we present the 
main ideas of the derivation. The formulation presented here assumes, for simplicity, an 
open-loop setting where the system is passively observed. It can be readily extended to a 
closed-loop control-based setting, and is presented in this more general framework in the 
appendix, including full details. 

To bring Q to a closed form, we use ADF with an assumed Gaussian density (see 
[T6] for details). Gonceptually, this may be envisioned as integrating 0 while replacing 

^The model considered in m assumes linear dynamics and uniform coding — meaning that the total 
rate of Nt, namely Xt (6, Xt) dO, is independent of Xt. However, these assumption are only relevant to 
establish other proposition in that paper. The proof of equation 0 still holds as is in our more general 
setting. 
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the distribution P by its approximating Gaussian “at each time step”. Assuming the 
moments are known exactly, the Gaussian is obtained by matching the first two moments 
of P [in]. Note that the solution of the resulting equations does not in general match the 
first two moments of the exact solution, though it may approximate it. 

Abusing notation, in the sequel we use /it, Et to refer to the ADF approximation rather 
than to the exact values. Substituting the normal distribution JV{x; for P{x,t) 

to compute the expectations involving At in <13> and using © and the Gaussian form 
of f{9), results in computable Gaussian integrals. Other terms may also be computed 
in closed form if the function A, D can be expanded as power series. This computation 
yields approximate equations for /it, Et between spikes. The updates at spike times can 
similarly be computed in closed form either from ([^ or directly from a Bayesian update 
of the posterior (see appendix, or e.g., m)- 

For simplicity, we assume that the dynamics are linear, dXt = AXt +DdWt^ resulting 
in the filtering equations 


d^t = A^tdt + gtT,tH^St {H— c) dt + Sl- ( {9 — P[gt-) N [dt x dO) (5) 

j 

dXt = {AT.t + 'S‘tA + DD'^) dt 


+ gt 


dt 


(Hgt - c) {Hgt - cf StHl^t 

-Et-H^SllHJ:t-dNt, 

where 5*“ ^ (E^ + FfEtFf^)”^, St = (E^ + Epop + iFEti?^)”\ and 

<7t = j\{9)d9 = Je^p [X {9, At)] d9 = AVdet (E^c^t) exp ||iJ/it - 


( 6 ) 


is the posterior expected total firing rate. Expressions including t~ are to be interpreted as 
left limits / {t~) = lim 5 _,,t- / (s), which are necessary since the solution is discontinuous 
at spike times. 

The last term in ([^ is to be interpreted as in Q. It contributes an instantaneous jump 
in /it at the time of a spike with preferred stimulus 9, moving Hgt closer to 9. Similarly, 
the last term in (|^ contributes an instantaneous jump in Ef at each spike time, which 
is the same regardless of spike location. All other terms describe the evolution of the 
posterior between spikes: the first few terms in (§-(§ are the same as in the dynamics 
of the prior, as in [I1I21], whereas the terms involving gt correspond to information from 
the absence of spikes. Note that the latter scale with gt, the expected total firing rate, 
i.e., lack of spikes becomes “more informative” the higher the expected rate of spikes. 

It is illustrative to consider these equations in the scalar case m = n = I, with H = 1. 
Letting af = Et,al^ = Etc, cr^op = ^pop, a = A,d = D yields 


dg-t 

dcTt 


agtdt + gt 


2aa; 


O'* + cr£ + a, 


gt~ 


{gt - c)dt + 


pop 


pop 


1 - . 


(gt - c) 


tc J 
2 


{9 - gt-) N {dt X d9) (7) 


erf + crfc + CT, 


pop 


dt — 


(Jt-dNt, 

( 8 ) 
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Figure 1: Left Changes to the posterior moments between spikes as a function of the 
current posterior mean estimate, for a static 1-d state. The parameters are a = d = 
0,H = 1, cTpop = 1, = 0.2, c = 0, A° = 10, at = 1. The bottom plot shows the density 

of preferred stimuli / (9) and tuning curve for a neuron with preferred stimulus 9 = 0. 
Right An example of filtering a linear one-dimensional process. Each dot correspond 
to a spike with the vertical location indicating the preferred stimulus 9. The curves to 
the right of the graph show the preferred stimulus density (black), and a tuning curve 
centered at 9 = 0 (gray). The tuning curve and preferred stimulus density are normalized 
to the same height for visualization. The bottom graph shows the posterior variance, with 
the vertical lines showing spike times. Parameters are: a = —0.1,d = 2,H = l,crpop = 
= 0.2, c = 0, A° = 10,/To = 0,0-0 = 1. Note the decrease of the posterior variance 
following t = 4 even though no spikes are observed. 


where gt = A°-^/27r(T^ M (gt ; C, 0-2 -h a^^ + a-2^p). Figure (left) shows how iit,<^t change 
between spikes for a static 1-dimensional state (o = d = 0). In this case, all terms in 
the filtering equations drop out except those involving gt. The term involving gt in dgt 
pushes gt away from c in the absence of spikes. This effect weakens as \gt — c| grows 
due to the factor gt, consistent with the idea that far from c, the lack of spikes is less 
surprising, hence less informative. The term involving gt in daf increases the variance 
when gt is near c, otherwise decreases it. 

3.3 Information from lack of spikes 

An interesting aspect of the filtering equations (Hi-® is that the dynamics of the posterior 
density between spikes differ from the prior dynamics. This is in contrast to previous 
models which assumed uniform coding: the (exact) filtering equations appearing in [23j 
and [21] have the same form as (§-(§ except that they do not include the correction terms 
involving gt, so that between spikes the dynamics are identical to the prior dynamics. This 
reflects the fact that lack of spikes in a time interval is an indication that the total firing 
rate is low; in the uniform coding case, this is not informative, since the total firing rate 
is independent of the state. 

Figure]^ (left) illustrates the information gained from lack of spikes. A static scalar 
state is observed by a process with rate ([^, and filtered twice: once with the correct value 
of CTpop, and once with cTpop —)■ oo, as in the uniform coding filter of [21]. Between spikes, 
the ADF estimate moves away from the population center c = 0, whereas the uniform 
coding estimate remains fixed. The size of this effect decreases with time, as the posterior 
variance estimate (not shown) decreases. The reduction in filtering errors gained from the 
additional terms in (H)-® is illustrated in Figure (right). Despite the approximation 
involved, the full filter significantly outperforms the uniform coding filter. The difference 
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Accumulated MSE for ADF and uniform-coding filter 



Figure 2: Left Illustration of information gained between spikes. A static state Xt = 0.5, 
shown in a dotted line, is observed and filtered twice: with the correct value cTpop = 0.5 
(“ADF”, solid blue line), and with Up^p = oo (“Uniform coding filter”, dashed green line). 
The curves to the right of the graph show the preferred stimulus density (black), and a 
tuning curve centered at 0 = 0 (gray). Both filters are initialized with fig = 0,(Tg = 1. 
Right Comparison of MSE for the ADF filter and the uniform coding filter. The vertical 
axis shows the integral of the square error integrated over the time interval [5,10], averaged 
over 1000 trials. Shaded areas indicate estimated errors, computed as the sample standard 
deviation divided by the square root of the number of trials. Parameters in both plots 
are a = d = 0, c = 0, ap^p = 0.5, cr^c = 0-1) H = 1, = 10. 

disappears as Upop increases and the population becomes uniform. 

Special cases To gain additional insight into the filtering equations, we consider 
their behavior in several limits, (i) As CTp^p —>■ oo, spikes become rare as the density 
/ (0) approaches 0 for any 0. The total expected rate of spikes gt also approaches 0, and 
the terms corresponding to information from lack of spikes vanish. Other terms in the 
equations are unaffected, (ii) In the limit —)■ oo, each neuron fires as a Poisson process 
with a constant rate independent of the observed state. The total expected firing rate gt 
saturates at its maximum, A°. Therefore the preferred stimuli of spiking neurons provide 
no information, nor does the presence or absence of spikes. Accordingly, all terms other 
than those related to the prior dynamics vanish, (iii) The uniform coding case [53J [21] is 
obtained as a special case in the limit cfp^p —>■ oo with /upop constant. In this limit the 
terms involving gt drop out, recovering the (exact) filtering equations in [23) . 


4 Optimal neural encoding 

We model the problem of optimal neural encoding as choosing the parameters c, Spop, Etc 
of the population and tuning curves, so as to minimize the steady-state MSE. As noted 
above, when the estimate is exactly the posterior mean, this is equivalent to minimizing 
the steady-state expected posterior variance. The posterior variance has the advantage 
of being less noisy than the square error itself, since by definition it is the mean of the 
square error (of the posterior mean) under conditioning by Mt ■ We explore the question 
of optimal neural encoding by measuring the steady-state variance through Monte Carlo 
simulations of the system dynamics and the filtering equations Since the posterior 

mean and variance computed by ADF are approximate, we verified numerically that the 
variance closely matches the MSE in the steady state when averaged across many trials 









(see appendix), suggesting that asymptotically the error in estimating /ij and Ej is small. 


4.1 Optimal population center 

We now consider the question of the optimal value for the population center c. Intuitively, 
if the prior distribution of the process X is unimodal with mode Xqi the optimal population 
center is at JIxo, to produce the most spikes. On the other hand, the terms involving gt in 
the filtering equation (H)-® suggest that the lack of spikes is also informative. Moreover, 
as seen in Figure (left), the posterior variance is reduced between spikes only when 
the current estimate is far enough from c. These considerations suggest that there is a 
trade-off between maximizing the frequency of spikes and maximizing the information 
obtained from lack of spikes, yielding an optimal value for c that differs from Hxq. 

We simulated a simple one-dimensional process to determine the optimal value of 
c which minimizes the approximate posterior variance E(. Figure (left) shows the 
posterior variance for varying values of the population center c and base firing rate 
For each firing rate, we note the value of c minimizing the posterior variance (the optimal 
population center), as well as the value of Cm = argmin^ ((icrt/dt|^j=o); which maximizes 
the reduction in the posterior variance when the current state estimate fj-t is at the process 
equilibrium xq = 0. Consistent with the discussion above, the optimal value lies between 
0 (where spikes are most abundant) and Cm (where lack of spikes is most informative). 
As could be expected, the optimal center is closer to 0 the higher the base firing rate. 
Similarly, wide tuning curves, which render the spikes less informative, lead to an optimal 
center farther from 0 (Figureright). 

A shift of the population center relative to the prior mode has been observed physiolog¬ 
ically in encoding of inter-aural time differences for localization of sound sources [26j . In 
|18j , this phenomenon was explained in a finite population model based on maximization 
of Fisher information. This is in contrast to the results of m, which consider a hetero¬ 
geneous population where the tuning curve width scales roughly inversely with neuron 
density. In this case, the population density maximizing the Fisher information is shown 
to be monotonic with the prior, i.e., more neurons should be assigned to more probable 
states. This apparent discrepancy may be due to the scaling of tuning curve widths in 
m, which produces roughly constant total firing rate, i.e., uniform coding. This demon¬ 
strates that a non-constant total firing rate, which renders lack of spikes informative, may 
be necessary to explain the physiologically observed shift phenomenon. 

4.2 Optimization of population distribution 

Next, we consider the optimization of the population distribution, namely, the simulta¬ 
neous optimization of the population center c and the population variance Epop in the 
case of a static scalar state. Previous work using a finite neuron population and a Fisher 
information-based criterion [18j has shown that the optimal distribution of preferred stim¬ 
uli depends on the prior variance. When it is small relative to the tuning curve width, 
optimal encoding is achieved by placing all preferred stimuli at a fixed distance from the 
prior mean. On the other hand, when the prior variance is large relative to the tuning 
curve width, optimal encoding is uniform (see figure 2 in [18)1. 

Similar results are obtained with our model, as shown in Figure]^ Here, a static scalar 
state drawn from Af{0, Up) is filtered by a population with tuning curve width utc = 1 
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Figure 3: Optimal population center location for filtering a linear one-dimensional process. 
Both graphs show the ratio of posterior standard deviation to the prior steady-state 
standard deviation of the process, along with the value of c minimizing the posterior 
variance (blue line), and minimizing the reduction of posterior variance when fit = 0 
(yellow line). The process is initialized from its steady-state distribution. The posterior 
variance is estimated by averaging over the time interval [5,10] and across 1000 trials for 
each data point. Parameters for both graphs: a = —l,d — 0.5,(TpQp = 0.1. In the graph 
on the left, = 0.01; on the right, A° = 50. 


and preferred stimulus density A/’(c,Up^p). In Figure]^ (left), the prior distribution is 
narrow relative to the tuning curve width, leading to an optimal population with a narrow 
population distribution far from the origin. In Figure]^ (right), the prior is wide relative 
to the tuning curve width, leading to an optimal population with variance that roughly 
matches the prior variance. When both the tuning curves and the population density 
are narrow relative to the prior, so that spikes are rare (low values of dpop in Figure 
(right)), the ADF approximation becomes poor, resulting in MSEs larger than the prior 
variance. 


5 Conclusions 

We have introduced an analytically tractable Bayesian approximation to point process 
filtering, allowing us to gain insight into the generally intractable infinite-dimensional 
filtering problem. The approach enables the derivation of near-optimal encoding schemes 
going beyond previously studied uniform coding assumptions. The framework is presented 
in continuous time, circumventing temporal discretization errors and numerical impreci- 
sions in sampling-based methods, applies to fully dynamic setups, and directly estimates 
the MSE rather than lower bounds to it. It successfully explains observed experimen¬ 
tal results, and opens the door to many future predictions. Future work will include a 
development of previously successful mean field approaches m within our more general 
framework, leading to further analytic insight. Moreover, the proposed strategy may lead 
to practically useful decoding of spike trains. 
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Steady-state variance, narrow prior 



Steady-state variance, wide prior 



Figure 4: Optimal population distribution depends on prior variance relative to tuning 
curve width. A static scalar state drawn from A/'(0, a^) is filtered with tuning curve 
(Ttc = 1 and preferred stimulus density A/'(c, tTpop). Both graphs show the posterior 
standard deviation relative to the prior standard deviation Cp. In the left graph, the 
prior distribution is narrow, = 0.1, whereas on the right, it is wide, = 10. In both 
cases the filter is initialized with the correct prior, and the square error is averaged over 
the time interval [5,10] and across 100 trials for each data point. 


A Appendix 

A.l Derivation of the ADF filtering equations for linear dynamics 

A. 1.1 Setting and notation 

In the main text, we have presented our model in an open-loop setting, where the process 
X is passively observed. Here we consider a more general setting, incorporating a control 
process Ut, so the dynamics are 


dXt = {A (Xt) + B [Ut)) dt + D [Xt) dWt, 


(9) 


where, in general, Ut is a function of Aft- 

For the purposes of the derivation, it is convenient to work with precision matrices 
rather than variance matrices. We write F = Ep^p,!? = St-' and Qt = Sj Thus the 
density of the process N at [t, 9) given X[Q^t] is 


At [9,Xt) = X 


= A“ 


0 / 1^1 


(27 


exp 


-l\\9-c\\l-^-\\HXt-9\\l,, 


( 10 ) 


' \F\ 
( 2 ^)” 


exp 


-^\\HXt-c\\l,-^ 


0-{F + A)-' [Fc + RHXt) 


F+R 

( 11 ) 


where M = [F~^ -f R~^) 

We denote by P (•, t) the posterior density of Xt given Aft, and by Ep [•] the posterior 
expectation based on observations up to time t. We will simply write Ep [•] when the 
time t is obvious from context. 
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A.1.2 Filtering equations between spikes 

Exact filtering equations for the first two moments As seen in [^, the PDE for 
the posterior density, 


dtP {x,t) = ClP {x, t)dt + P (x, t) 


Xt {0,x)-Xt (9) 

At m 


(a (dt X d0) - At (0) d9 dt 


( 12 ) 

still holds in the closed-loop case. Here Ct (appearing in place of C in is the posterior 
infinitesimal generator, dehned with an additional conditioning on Aft, 


Ctf (x) = lim 
At-tO+ 


E[/ {Xt+At)\Xt = x,J^t]- fix) 
At 


and Cl is its adjoint. Note that in this closed-loop setting, the infinitesimal generator is 
itself a random operator, due to its dependence on past observations through the control 
law, and that Nt is no longer a doubly-stochastic Poisson process. 


Between spikes, (12) simplifies to 


(a., t) = CIP {x, t)-P {x, t) (a* (0, x) - At (0)) dO, 


so for a sufficiently well-behaved function /, 

dPp [f{Xt)] 
dt 


f {x) [CIP {x, t) + P [x, t) J l^Xt (9) - Xt {9, x)j d9j dx 
P{x,t) (Ctf{x) + fix) [ (Xt (6») - At i9,x)) dd^ dx 


= Ep 


Ctf (At) + / (At) / At i9) - At i9, Xt) d9 


Assuming the state evolves as in the (closed loop) infinitesimal generator is 


Ctf ix) = (A ix) + B (C/t))^ V/ ix) + ^Tr VV (x) D (cr) D ix)^ 


so, letting fj.t = EpAt, At = Xt - fit,Ct = Ep 
dtit 


dt 

dSt 

dt 


= Ep[AiXt)]+BiUt)+Ep 


XtXT 


At / {Xti9)-Xti9,Xt))d9 


— Ep 
-l-Ep 


A (At) A, 


+ E/ 


At A (At) 


■ Er 


P(At)P(At)' 


AtAf / {Xti9)-Xti9,Xt))d9 


(13) 


which is the same as Q, with an additional term B (Pt)- 
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ADF approximation The computations that follow frequently require multiplying 
Gaussian functions, sometimes with a possibly degenerate precision matrix. To this end, 
we use the following slightly generalized form of a well-known result about the sum of 
quadratic forms. 

Claim. Let x,a,b G M" and A,B G be symmetric matrices such that A -f S is 

non-singular. Then 


||x- a||^ 


^IIb — A(A+B)-^ B 


X — {A + B) ^ {Aa -\- Bb) 


2 

A+B ' 


Proof. This is proved by a straightforward completion of squares. If B are invertible, 
\\x - a\\\ + llx - 6||^ = ||a;||^ - x'^Aa - Ax + ||a||^ -I- ||a;||g - x'^Bb - b'^Bx + ||6||£ 

= \\x\?A+B - + Bb) - (Aa + Bb)^ x + \\a\\\ + Ml 


x—{A + B) ^ {Aa + Bb) 
X — {A + B) ^ {Aa + Bb) 


A+B 

2 

A+B 


{A + B)~^ {Aa + Bb) 

+ II«IIa + li^lls “ 11"^® + 


A+B 
2 


* = IkllA + ml - WMIU+b)-^ - mA (A + B)-^ Bb - b^B (A + B)-^ Au - 115511^^+3)-, 

= a^A (a-{A + B)~^ Aa-{A + B)~^ Bb^ + b^B (b-{A + B)~^ Bb-{A + B)~^ Aa^ 
= a^A {A + B)~^ B{a-b) + b^B {A + B)~^ A{b-a) 

= a^ (^B~^ + A~^') ^ {a — b) + b^ [A~^ + B~^) ^ {b — a) 

= II® ~ 

= II® ~ ^IU(A-|-B)-iB 

By continuity, the claim also holds when A,B are not both invertible, provided {A -|- B) 
is invertible. □ 



Computing the expectations in (131 involves computation of integrals containing 
P {x, t) At {9, x). Taking the ADF approximation P {x, t) « V (a;; fitMt), and using the 
claim above, we have 


P{x,t)Xt{9,x) « A° 


I |F| 

(27r)” 


1 


Af {x; [-^ [\\Hx - c\\l[ + 6 - {F + R) ^{Fc + RHx) 


\-i 


F+R 


= A' 


0 


\F\ 


(27r) 


X exp 

A° I 


m+n I 

1 
2 


• exp 




\HTMH 


9 - {F + R)~^ {Fc + RHx) 


F+R 


|J^I 


(27r)’"+” |St 
1 


exp I -- [M - cIIqm + ||x - Aif ||q^+3tmb 


xexp 1 -- 


9 - {F + R)~^ {Fc+ RHx) 


F+R 
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where ^ is any right inverse of H, and 
c = H-^c 

M = F {F + R)~^ R = (F-^ + R-^y^ 

= (Qt+H^MHyyQtfXt + H^MHc) 
gf A QyQ^ +H'^Mny^ H’^MH = {I + H^MHJ^ty^ H^MH. 

An alternate form for may be derived from the Woodbury identity as follows, 

gf = [l + H'^MHY.ty^ H'^MH 

= {M-^+HY,tHy~^ HY.t^H'^MH 

= {l - {M-^ + MH 

= MH 

= H^syn, (14) 


where 

so we can write 


sy = (M-i + H^tHy , 

P{x,t)Xt{e,x) « ^ 2 ny+l (~^ ~ ^11V + 11^ ~ (^^Wq.+htmh)) 


xexp 1 -- 


0- (F + (Fc+RHx) 


F+R 


Now we define 


gt = J Xt (0) de = J Ep [Xt {9, Xt)] de, 

and compute its value as follows: 
gt = fXt (9) d9 


J J P{x,t) Xt {9,x)dxd9 


I det F 
(27r)™+" det Et 
1 


exp ^“2 II^Mi - c|||m^ j dxexp^- 


1 II _ M||2 

2 IF UQt + H^MH 


X I d9 exp ( — - 


- {F + R)~^ {Fc + RHx) 


= 


det F 


(27r)” det S* det {F + R) 


exp -V \\Hg,t - c\ 


F+R 
2 




1 


(ixexpl ^\\x g,t \q^_^.htmh 


= X^ 


I det F 

det St det {Qt + H^MH) det {F + R) 


1 2 
exp ( “2 \\Hfit-c\\gA 
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To simplify the expression under the square root, we note that 


det F 


det (F + R) 


det {f {F + = det + F-^) ^ 


detM 
det R ’ 


and, using the matrix determinant lemma and (14) 


det {Qt + = det Qt det M det (M ^ + FlY^tH^) = 


detM 


det S, det 


so 


,0 /detS'f / 1,,^^ „2 


A similar computation yields additional terms from (13), expressed in terms of gt- 


Ef 


Xt J Xt{9,Xt)d9 = J dx J d9xP {x,t) Xt {9,x) 


det F 

\ m+n 


y (27r)"‘^"detEt 
X J dxxexp - ||x —/i( 


exp ( \\Hgt-c\\lA 


X / d 0 exp ( —- 


M||2 

UQt+HTMH 
\-l 


6»- (F + i?)”" (Fc + FFcc) 


= A 


det F 


exp ( “2 II^Mt - c| 


F+R 
2 


= A' 


(27r)"' det E* det (F + R) 

X J dxxexp ^—-\\x — Wq^+h^mh 
det F 


Si 


M 


det Et det {Qt + H^MH) det (F + R) 


exp ( \\Hgt-c\\% 


= 


15 




















Ep 


XtXj / \t{e,Xt)de 


dx J d9P (x, t) {x — fit) (x — fit)^ At (0, x) dO 

■exp \\H fit - c\\^sf^ 


det F 


(2tt) 


m+n 


det Et 


X J dx {x - fit) {x - fit)^ exp (^-^\\x- 

2 

F+R, 


X I dO exp ( — - 


' - {F + R)~^ {Fc + RHx) 


= X 


det F 


■exp ( - i \\Hfit-c\\lM 


(27r)" det Et det {F + R) 

X J dx {x - fit) {x - fit)^ exp (-^\\x-fif^Wg^^jjTMH 


= 


det F 


= gt 


I det Et det (Qt + Fl'^MH) det (F + R) 

xp IlF^t - c|||m) [{Qt + H^MH)-" + {fit - fi^) {fit - fiff 
{Qt + H^MH)-^ + {fit - fif) {fit - fifY 


Assuming X has linear dynamics, substituting these results into yields the follo^wing 
filtering equations between spikes (we abuse notation and use fit, Et to refer to the ADF- 
approximate quantities from here on), 

= Agt + B {Ut) + gt {fit — fif^), 

^ = AEt + EtA + FF^ 

at 


+9t 


Et - {Qt + H^MH) - {fit - fif) {fit - fi^) 


(15) 


We simplify this by computing 


fit - Aif 


fit - {Qt + H^MH)~^ {Qtfit + H^MHc) 
{Qt + H^MH)~^ H^MH {fit - c) 

Etgf {fit - c) 

Y^tH^Sf {Hfit - c ), 


Yt- {Qt + H^MH)' 


= E, 


{l - {I + H'^MHYt) 


YtH"^ {M-^ + HYtH^) ^ HYt 

?fSt, 




where we have used the Woodbury identity and (14). Substituting into (15) we obtain 
the form 
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djH 

dt 

dSt 

dt 


A^t + B (Ut) + {HfJ't ~ c) 


AEt + EtA + DD^ 


Agt 


SfH^t - Sf {Hgi-. 


c) {H^Mt - cf SfH^t 


(16) 


A.1.3 Effect of spikes 


= P(x,t 


When a spike occurs at time t in preferred location 6, the update according to (12) is 
p{x,t+) = p + P {x,t-) 

At- {e,x) 

At- (9) 

_ P{x,t-)Xt- (9,x) 
f P (x,t~) Xf- (9,x)dx 

To compute this sum we note that P {x,t) At {9,x), under the ADF approximation, may 
be written as a single Gaussian in x, 


P {x,t) Xt {9,x) dx « A° 


= A*^ 


detF 


(27r) 


m+n 


det Sy 


exp ( -^ llx - gt\\Q^ “ ^ 11^ “ - \ - ^Wr 


detF 


(27r) 


m+n 


det Sy 


exp \\9 - cfp -\\\x- Mtllg, “ ^ 


= Ct (6») • exp ( - - x-{Qt + H'^RH) [Qtfit + R9) 


Qt+RTRH 


where 


Ct{9) = A° 


det F 


(27 


)'"+” det E 


exp ( \\9-c\\l 




<9f = Qt {Qt + H^RH) ^ H^RH = (/ + H^RHY,t) ^ H'^RH. 
Analogously to the computation for above, we have Qf = S^H, where 

Now P (x, is given by the normalized Gaussian, 

P(x,t-)Xt~ (9,x) 


P [x, G 


/ P {x, t~) At- {9, x) dx 


det (E-i -t RTRH) 


expl-- 


- (St- -f H'^RH) ^ (S-+t- + H'^R9) 


T.p+H'i'RH 


N (x, (S+ -h H'^RH) ^ (E-Vt- + H'^R9) , (E+ -h H'^RH) , 
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and the update is 


^ {Y.-}lit- + Re) 

Yt+ = {Y-} + RH)~^. 

To incorporate these updates into the inter-spike SDE they can be cast in the form 

+ {Y-} + h'^rh)~^ h'^rh {H-^e - nt -) 

= fit- + {Hr ~ e-t-) 

= fit- +Yt~H^S^-i0-Hfit-), 

Yt+ = Yt- - {Y-} H'^RH)~^ H'^RHYt- 

= Yt--Yt-Q^-Yt- 
= Yt--Yt-H'^S^-HYt-, 

giving the full filtering SDE 


dfit = Afitdt + B{Ut)dt + gtYtH^St^ {Hfit-c)dt + Yt-H'^S^- [ {9 - Hfit-) N {dt x dO), 

dYt = (^AYt + YtA + DD'^ + gt [YtH^Sf HYt - YtH^S^ {HfXt - c) {Hfn - cf SfHYt] ) 


-Yt-H^ S^.HYt-dNf 


dt 

(17) 


A.2 Non-linear dynamics 

In case of non-linear dynamics 


dXt = {A (Xt) + B {Ut)) dt + DtdWt 


the ADF approximation may also be applied to the terms involving A{Xt) in (13). As¬ 
sume the *-th element of A, is given by a power series around fit, written in multi¬ 
index notation, 

AW (x)=^AW ifit)ix-fitr, 


where the sum is over all multi-indices a. Then, assuming the ADF approximation 

Xt^J\fifit,Yt), 


Ep 


A«(Xt) =^AW(/idE„(Sd, 


where E^ (E) is defined as E {Z°‘) = E J([j, for Z ~ A/'(0, S), and may be computed 
from Isserlis’ theorem. Similarly, 


Ep 


A{Xt)X^ 




Ep [a(*) {Xt) (x[^'> - fii^^) 

{fit)E^+,^{Yt), 


where Cj is j-th standard basis vector (the multi-index corresponding to the single index 
])■ 
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Writing and = (E„+ei (^t) ,■■■, Ea+e„ (^t)) the filtering 

equations become 

d^t = Ag (nt) Eg (St) + B {Ut) dt + 

oc 

gtY.tQt^ {^it-c)dt + Y.t~Qf-. f {H~^e - g.t-) N {dt x dO) 
dEt = ( ^ (a^ {lit) K,t + Eo.i^a (/Xt)^) + DD'^ + gt htQ^^t - {lH - c) (/x* - cf Qf^St 


-Et-Qf.St-dNf 

Analogous comments apply when the noise gain Dt is a non-linear function D (At), pro¬ 
vided each element D {x) D (x)^ may be expanded as a power series. 


A.3 Comparison of estimated posterior variance and MSE 

In the main text, we studied optimal encoding using the posterior variance as a proxy for 
the MSE. Letting /Xt,St denote the approximate posterior moments given by the hlter, 
the MSE and posterior variance are related as follows. 


MSEt = 


E 


tr (At 


Mt) (At — /xt)^ 


EE‘ptr(At-Mt)(At-AXt)^ 


= E[tr(Var^At)] -f E 


tr {gt - E*pAt) {gt 



where Ep [•], Varp [•] are resp. the mean and covariance conditioned on A/), and tr is the 
trace operator. Thus for an exact filter, having gt = EpAt, Et = CovpAt, we would have 
MSEt = trace[E(Et)]. Conversely, if we find that MSEt « trace[ESf], it suggests that the 
errors are small (though this is not guaranteed, since the errors in gt and Et may effect 
the MSE in opposite directions, if the variance is underestimated). 

Figure shows the variance and MSE in estimating a linear one-dimensional pro¬ 
cess, after averaging across 1000 trials. Although the posterior variance is, on average, 
overestimated at the start of trials, in the steady state it approximates the square error 
reasonably well. 

We also show here variants of the Figures and (Figures and respectively), 
showing the MSE rather than the variance. The results look similar but noisier, except 
in Figure [7b| for small population variance, where the ADF estimation is poor due to very 
few spikes occurring. 
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Sensory parameters are c = OjCTp^p = 0.1, = 0.01, A° = 10. The means were taken 

across 1000 trials. Shaded areas indicate error estimates obtained as sample standard 
deviation divided by square root of number of trials. 



Figure 6: Mean Square Error as a function of model parameters. This figure is based 
on the same data as Figure with Root Mean Square Error (RMSE) plotted instead of 
estimated posterior variance. See Figure]^ for more details. 
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curve width. This figure is based on the same data as Figure]^ with MSE plotted instead 
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